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ABSTRACT 

In an emerging trend, more and more Internet users search 
for information from Community Question and Answer (CQA) 
websites, as interactive communication in such websites pro- 
vides users with a rare feeling of trust. More often than not, 
end users look for instant help when they browse the CQA 
websites for the best answers. Hence, it is imperative that 
they should be warned of any potential commercial cam- 
paigns hidden behind the answers. However, existing re- 
search focuses more on the quality of answers and does not 
meet the above need. In this paper, we develop a system 
that automatically analyzes the hidden patterns of commer- 
cial spam and raises alarms instantaneously to end users 
whenever a potential commercial campaign is detected. Our 
detection method integrates semantic analysis and posters' 
track records and utilizes the special features of CQA web- 
sites largely different from those in other types of forums 
such as microblogs or news reports. Our system is adaptive 
and accommodates new evidence uncovered by the detection 
algorithms over time. Validated with real-world trace data 
from a popular Chinese CQA website over a period of three 
months, our system shows great potential towards adaptive 
online detection of CQA spams. 

Categories and Subject Descriptors 

H. 4 [Information Systems Applications]: General; J. 4 
[Social and Behavioral Science]: Sociology 

General Terms 

Design, Experimentation, Measurement 

Keywords 

CQA forums; Online detection; Paid posters 

I. INTRODUCTION 

Web 2.0 social websites are playing an increasingly impor- 
tant role on the Internet by utilizing the wisdom of crowds [24j . 
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One such example is the Community Question and Answer 
(CQA) portals on which users can post and answer ques- 
tions, such as Yahoo! Answers, Naver and Baidu Zhidao 4, 
|30| |20| . Some CQA websites like Quora [23] attract users 
by offering professional answers, most of which come from 
verified people in reality. These websites gain popularity 
and trust by providing a sense of interaction between the 
questioner and the masses. With millions of archived Q&A 
sessions |29] , CQA forums have become a major source of 
advice for many Internet users. 

As a large knowledge base of crowds, the archived Q&A 
sessions have been used for automatic question answering 
and recommendation. Nevertheless, the quality of user- 
generated content in the Q&A sessions varies drastically. 
For instance, some answers do not match the questions and 
even contain spam and rude words. In recent years, tremen- 
dous efforts have been made to locate better answers and 
remove spam from the archived questions and answers re- 
source. Techniques such as analysis of text, user-question- 
answer's link relationship, and user feedback features have 
been used in tools like PageRank to identify high-quality 
web pages [l3j[l5j[2]. 

Existing techniques, however, may not work well in the 
presence of the so-called Internet water army, a large crowd 
of hidden posters who get paid to generate artificial content 
in the social media for commercial profits. Paid posters have 
become popular with the booming of crowd-sourcing mar- 
keting. As confirmed in [26] , crowd-sourcing systems such as 
Amazon's Mechanical Turk, Zhu Ba Jie (a similar Chinese 
crowd-sourcing site), have been broadly used for commer- 
cial campaigns. Due to their popularity, the CQA forums 
have become the targets of those campaigns that create un- 
truthful Q&A sessions for commercial purpose. Consider 
the following example: 

Question: I tried several methods to lose weight but all 
failed. What should I do? Please give me some advice! 

Best answer: Don't worry, I have experienced the same 
pain as you. Firstly, you have to keep a healthy diet. Be 
careful about the nutrition in your food and never eat fast 
food. Secondly, don't sit too long in front of a computer. 
Finally, perform physical exercise everyday. What's more, 
you can also try a product named X. This product cotains 
ingredients such as ... and can help you lose weight without 
any risks. 

The above Q&A session was actually generated by paid 
posters. The answer provides very practical advice at first 
and then gives suggestion on the product which needs to be 
promoted. The practical advice part is to earn the trust of 



the users. We have observed that fake answers generated 
by paid posters are often long enough and quite relevant to 
the questions, and some paid posters involved in the fake 
Q&A sessions are ranked high according to the website's 
reputation system. 

Based on textual similarities, previous work [18] [6] [T] is 
likely to treat the above answer as of high quality due to 
the high relevance of textual features between the answer 
and question content. As a result, the output may contain 
commercial spam, resulting in a credibility problem. There- 
fore, additional strategies, such as writing templates, public 
calls for commercial campaigns, and a poster's track rep- 
utation, should be integrated for the effective detection of 
paid posters. Furthermore, most existing work relies on of- 
fline analysis, while end users demand for instant help and 
should be warned of potential commercial campaigns when 
they browse a CQA forum. The call for a real-time response 
system that can detect potentially fake Q&A sessions on the 
fly is strong. 

We tackle the above two challenges in this paper by de- 
signing an adaptive online detection system tailored specif- 
ically for CQA forums. Our contributions are as follows: 

• We discover that the behavioral features of paid posters 
are different in CQA forums when compared to other 
types of forums such as microblog (also called Weibo, 
a Twitter like service in China) and news reports. We 
identify the special features of paid posters in CQA 
forums that are useful in the detection. 

• Based on the identified special features, we design a 
detection method which uses machine-learning tech- 
niques and assigns credibility scores to each of the best 
answers by using semantic analysis and user features, 
such as users' history data. 

• We implement an adaptive, online detection system 
which automatically analyzes the hidden patterns of 
commercial spams and raises alarms instantaneously 
to end users whenever a potential commercial cam- 
paign is detected. Our system is adaptive and accom- 
modates new evidence gathered by the detection algo- 
rithms over time. 

2. DATA COLLECTION AND LABELING 

2.1 How Do Online Paid Posters Work in CQA 
Portals 

To understand the background, we start with a brief in- 
troduction on how online paid posters work in CQA sites. 

With the advent of popular crowd-sourcing websites, com- 
panies tend to hire paid posters to help them hype their 
products in the social media. Research [26] has shown that 
paid posters are capable of generating large information cas- 
cades that could escape security check and accelerate spam 
dissemination on social media, like microblogging services 
and community-based question and answer websites. 

Our research is based on Baidu Zhidao, a Chinese CQA 
website that is similar to Yahoo! Answers. During our 
study on the CQA-oriented promoting campaigns on crowd- 
sourcing websites, we discovered specification with detailed 
requirements and templates for the paid posters. The re- 
quirements provide not only basic description regarding the 
product but also types of paid posters needed. For example, 



some companies request that posters should have a good 
reputation. Note that many CQA websites have a repu- 
tation system and assign high-level reputation indicators to 
"trustworthy" users whose answers are mostly selected as the 
best answers. Those reputation systems track the history of 
users but are not designed to analyse and detect online paid 
posters. 

It is very interesting to notice that companies that hire 
paid posters also provide several templates for questions and 
answers. For instance, in a medicine promotion case, the 
question describes a certain symptom, and the answer ex- 
plains reasons for the symptom and recommends taking the 
specific medicine. Both question and answer templates are 
carefully crafted to sound real. The answers usually include 
personal experience with the products. In addition, the in- 
structions will advise paid posters to insert their own sen- 
tences in the templates rather than just copying and pasting 
the templates. 

Using these templates, paid posters can create complete 
Q&A sessions. They first pose a question, and use a differ- 
ent user ID to post the answer. This could be achieved by 
one user registering for multiple IDs or by several colluding 
posters. They then select the answer as the best answer, 
after waiting for other users to post answers. This waiting 
time is to cheat the readers into believing that the best an- 
swer is chosen from many answers. In the CQA portals, once 
the best answer is decided, the Q&A session is considered 
closed and no new answers can be added to the session. 

2.2 Data Collection 

Users who register on Baidu Zhidao participate in various 
Q&A sessions, either as question askers or repliers. Since we 
already know that paid posters who accept missions from 
crowd-sourcing sites create a variety of Q&A sessions on the 
site for product propaganda, the collecting process can be 
targeted directly to the product campaigns. In addition, 
since the readers tend to pay more attention to the best 
answers and also due to the manner in which online paid 
posters are supposed to work, we only collected the best 
answers and ignored other ones. This is to avoid collecting 
a large amount of irrelevant information for this study. 

In order to collect campaign Q&A sessions, we first visited 
the crowd-sourcing websites, where the paid posters apply 
for campaign tasks and get paid, as stated in Section[l] After 
going through campaigns calling for paid posters, we selected 
11 closed requests because the paid posters who worked for 
the 11 products had finished the tasks. We extracted key- 
words from the 11 products and searched for Q&A sessions 
with the keywords on Baidu Zhidao. We used a crawler 
to collect all the links from the searching result. We then 
used another crawler to visit and download the web pages 
associated with the links. The results included not only the 
campaign sessions, but also normal sessions containing the 
keywords. After parsing all the collected web pages, we ob- 
tained a group of target users, including both paid posters 
and normal users, as well as the links to the users' home- 
pages hosted by Baidu Zhidao. 

By following the users' homepages, we could find useful in- 
formation for our research. For example, a user's homepage 
has the question-answer history of this user, and includes all 
the Q&A sessions where this user posted his/her answers. 
The question-answer history provides a good knowledge on 
the multiple campaigns that a potential paid poster might 



have been involved. 

Having obtained the initial dataset of IDs and links, we 
then visited each user's homepage, retrieved every Q&A ses- 
sion that the user participated in. We only collected the 
closed Q&A sessions (i.e., the best answer determined). A 
closed Q&A session implies that users can no longer post 
new answers to the question, but they can click the "Like" 
button to support the posted answers, including the best 
answer and other answers. From those Q&A sessions, we fi- 
nally extracted information used in our analysis. The recorded 
information from those web pages includes questioner ID, 
answer ID, time, title, question content, answer content, user 
feedbacks (visited times, ratings). 

From the Q&A website, Baidu Zhidao, we crawled, 6462 
users' question-answer history records accumulated during a 
three-month period from October to December in 2011. For 
each user, we built a list of history information, showing the 
question, answer, participated user IDs, and other features. 
Associated with the 6462 user IDs, we have 75,200 Q&A 
sessions in total, all having the best answer. 

2.3 Manual Data Labeling 

To get a sample dataset for feature analysis, campaign 
sessions should be differentiated from the normal ones. By 
reading the best answers, we manually labeled the Q&A 
sessions in the dataset. The labeling process mainly depends 
on the Q&A templates from the crowd-sourcing websites 
such as Zhubajie [3l] and Tiancaicheng 25 . We summarize 
the applied techniques below: 

1. Since we have collected a list of 11 products which 
were hyped in the Baidu Zhidao, we could compare 
the Q&A content with the campaign templates. If the 
product's name is in the 11 initial samples and the 
contents match the templates, such as the descriptive 
words and the organized pattern of sentences, we la- 
beled it as a campaign Q&A session. We stress that 
there is difference between our work and related re- 
search which needs to judge the quality of answers. 
The evaluation of quality of answers is usually based on 
question-answer relevance, length of the texts, gram- 
mar correctness, politeness, and so on. To obtain a 
reliable dataset, researchers often rely on multiple as- 
sessors and are faced with the difficulty of reaching 
an agreement among the multiple evaluation results. 
Our labeling method differs from the above and largely 
avoids the annotation difficulty, because we know ex- 
actly the name of the hyped product and how paid 
posters would write the Q&A sessions. 

2. When we encountered new products not in the list of 
11 initial samples, we recorded the product's name and 
searched it in the crowd-sourcing websites. If we found 
the template of this product, we use the above method 
to compare their contents. 

3. If a new product is listed in the campaign websites 
but the template is not available, we followed some 
special features normally found in Email spam to make 
a decision. For example, a spam may use different 
fonts to write the telephone numbers and insert special 
characters between the product's name. This type of 
operations is usually used to escape detection by the 
filter system. We labeled the session as campaign if 



the product's name is in a campaign list and the best 
answer has special features similar to Email spam. 

4. If we could not find the new product in the campaign 
websites, we then tried to identify potential templates 
used in the same category of products and special fea- 
tures obvious in an Email spam. If none of those could 
be identified, we labeled the session as a normal ses- 



Up to now, we have labeled 4998 samples in our dataset. 
Among these, 2147 samples are campaign Q&A sessions and 
the other 2851 samples are normal ones. The sample size is 
large enough for our current study. Since we selected 11 cam- 
paigns, which were posted on the crowdsourcing websites, as 
the seeds of our crawler, the proportion of campaign sessions 
is relatively high in the dataset. 

When we manually labeled our datasets, we carefully read 
the contents of a user's post. The meaning can be under- 
stood by human but is hard to use in machine learning based 
classification. Even with the above template based labeling 
method, it is not easy to write an algorithm to automatically 
identify a campaign session because a poster may re-phrase 
the template in their own words. Due to these reasons, we 
need to search for statistical features that can be effectively 
used towards building a detection system. 

3. ANALYSIS OF STATISTICAL FEATURES 
3. 1 Insufficiency of Existing Statistical Features 

We firstly demonstrate the difficulty of the problem we 
study by analyzing existing features, some of which have 
been used in related research such as evaluation of high 
quality answers or detection of Internet water army in news 
report websites [8] and showing their limitations. 

3.1.1 Interval Post Time 

In [l9], Arjun et al. defined several spamming indicators 
for modelling the behaviour of fake review writers. They 
found that spammers of a spam group tend to post reviews 
during a short time interval. This feature has been shown to 
be a good indicator to detect Internet water army in news 
report websites [8]. 

In our work, we consider two time stamps for a Q&A ses- 
sion: One is the time when the questioner post the question 
topic (the ask time) , and the other one is the time when the 
best answer is posted by a replier (the best answer posted 
time). We define interval post time as the latter time stamp 
minus the former one. 

In Figure [T] we show the probability distribution of inter- 
val post time with "pp" (abbreviation of paid posters) for 
campaign sessions and "nu" (abbreviation of normal users) 
for non-campaign sessions. The x-axis is drawn by Ig scale. 

From the figure, we find it difficult to tell the difference 
between campaign and non-campaign Q&A sessions. Two 
reasons may contribute to the above phenomenon. There 
are many normal users who spend much time on the Q&A 
website and try to post answers to open questions, especially 
those questions associated with some rewards points. These 
people are known as bounty hunters. Most bounty hunters 
post very good answers because they want to get more re- 
wards points. On the other hand, online paid posters, be- 
fore they post and choose the best answer, normally wait 




Interval Posttime(s) Interval Post Time(s) 

Figure 1: The PDF and CDF of the interval post 
time 

for some random time for other answers appearing in the 
session. This is to give readers a fake impression that the 
best answer is selected among many answers. While paid 
posters try to finish a job as quickly as possible in news 
review websites [8], the same behaviour does not exist here. 

3.1.2 Number of Other Answers 

Before the question is closed, users can post their own 
answers. This variable counts the number of answers other 
than the best one. Intuitively, if the paid posters create the 
sessions themselves, they may not have patience to wait for 
more replies. They could close the sessions and get paid 
as soon as possible. To test this conjecture, we show the 
probability distribution of this feature for campaign sessions 
and normal sessions in Figure [2] . 

Similar to the interval post time, the number of other 
answers does not indicate much difference for the two types 
of Q&A sessions. This invalidates the above conjecture and 
we do not consider it a good feature for the detection of paid 
posters in CQA portals. 

3.1.3 Number of Likes 

Similar to the "Like" button in Facebook, if other read- 
ers find the best answer to be helpful, they may click the 
"like" button. The number on the button indicates the total 
number of clicks. Intuitively, this feature represents user's 
feedback and should be helpful in identifying trustful an- 
swers. The more "likes" an answer receives, the more likely 
it is a good answer. However, as shown in Figure [3j this is 
not a reliable feature. This is because the paid posters could 
click the button themselves and even use different user IDs 
to click multiple times. This behavior is also confirmed in [5] 
as the "vote spam attack". 

3.1.4 Relevance between Questions and The Best An- 
swers 

This feature is extensively used before in identifying high- 
quality answers [18,6,2,7,22. The previous work is usually 
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Figure 2: The PMF and CDF of the number of other 
answers 
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Figure 3: The PMF and CDF of the number of likes 

based on following assumptions: 

1. Semantically high relevance between questions and an- 
swers indicates high quality. 

2. Selected best answers should have higher quality than 
other answers. 

The above assumptions are risky for the detection of po- 
tential campaigns created by paid posters. In commercial 
campaigns, answers with high-quality are rather misleading 
and would beat the retrieval mechanism. Many of the an- 
swers are well-organized and highly related to the questions. 
In this sense, a "high-quality" answer does not necessarily 
mean trustworthiness. Thus, we do not consider the rele- 
vance measure in our work. 



3.2 Special Features for CQA Portals 

The limitations of existing statistical features shown above 
led us to look for new features specific to users in CQA 
websites. 

3.2.1 Spam Grade of Questioner ID ( SGqID ) 

It indicates whether the questioner tends to ask campaign 
questions. For a given questioner ID (qlD), we calculate 
the ratio of the number of campaign sessions and the total 
number of sessions in which the user has participated, 

3i 



SGqID = 



qo + qi 



(l) 



where qo and qi are the number of non-campaign and cam- 
paign sessions where the user appears as the questioner, re- 
spectively. If the system does not record such information, 
set its SGqID value to 0.5. [] 

3.2.2 Spam Grade of Answerer ID ( SGalD ) 

It indicates whether the best answer poster tends to write 
campaign answers. For a given answerer ID (alD), we cal- 
culate the ratio of the number of campaign sessions and the 
total number of sessions in which the user has participated, 



SGalD = 



a i 



ao + ai 



(2) 



where ao and ai are the number of non-campaign and cam- 
paign sessions the user appears as the poster of the best 
answers, respectively. Similar to SGqID, if the system does 
not record such information, we set its SGalD value to 0.5. 

3.2.3 Spam Grade of the Text (SGtext) 

It indicates whether the collection of words in sessions 
associated to a user tends to be campaign specific. To cal- 
culate this feature, we need to perform statistical analysis 
over the words. Text information of a Q&A session consists 
of the title, the content of question, and the content of the 
best answer. We remove the duplicate words so that we can 
get a collection of distinct words, wordi, word,2, words ■■■ 
word„, for each Q&A session. For each word, we calculate 
spam grade which characterizes the property of the word, 
i.e., whether it is more campaign oriented or non-campaign 
oriented. Words with higher benchmark are more likely to 
imply hidden promotion behavior. To get rid of the impact 
of different length, we take the average value over the sum- 
mation of the benchmarks of all words as the spam grade of 
the whole text. For each word, the definition of spam grade 
goes like this: 



SGwordi — lg 



N - 



n% + 1 



Sj + 1 
S+l 



(3) 



where TV and S are the total number of non-campaign and 
campaign sessions in the databases and n; and s; are the 
number of non-campaign and campaign sessions where the 
wordi appears. The term is used to normalize the 

result in case of zero counts. Then the spam grade of text 
with L distinct words is calculated as: 



SGtext = 



SGwordi + SGword2 + ... + SGwordL 
L 



(4) 
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This decision follows the Maximum Entropy Principle 
i.e., we should "make use of all the information that is given 
and scrupulously avoid making assumptions about informa- 
tion that is not available." 



3.3 Property of the Feature Set 

Figure [1] exhibits the values using the three "SG" features 
in the previous section. 
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Figure 4: 4998 samples captured by SGqID, SGalD 
and SGtext 

Through this figure, we can observe a clear gap between 
the campaign sessions and non-campaign sessions. We can 
then apply regression based techniques to calculate the cam- 
paign score, which indicates whether a Q&A session tends 
to be a campaign. 

4. DETECTION METHOD 

In this section, we introduce a logistic regression approach [10| 
[l] to calculate campaign scores for Q&A sessions using the 
three proposed "SG" features. 

4.1 The Algorithm 

Figure[4]has already shown that the samples can be distin- 
guished by the three proposed features, SGqID, SGalD and 
SGtext. In order to get a score indicating whether a Q&A 
session is a potential commercial campaign or not, we apply 
logistic regression as the learning method. We can use it to 
calculate values of P(Y = 1\X, 9) and P(Y = 0\X, 9). Here, 
Y is a indicator variable, where Y = 1 and Y = represent 
campaign and non-campaign Q&A sessions, respectively. X 
is a vector of three features for each session. is a vector 
of model parameters, each associated with a session feature 
and including an individually constant item which is not 
related to the session features. 

By applying the sigmoid function, the hypothesis he(X) 
which outputs a score of P(Y = 1\X,9) or P(Y = 0\X,9) 
(termed as campaign score) is defined as follows: 



hg(X) 



1 + e- 



(5) 



where T X = 6»i + <9 2 * SGqID + 9 3 * SGalD + 9 4 * SGtext. 
To facilitate the matrix calculation, we add an all-1 column 
to X. 

In practice, the higher the score, the higher the probability 
that the given session is a campaign session. The values of 
will be learned by logistic regression. The objective then 
becomes an regression problem where we optimize the model 
so that the output campaign scores of sessions are close to 
their true labels (0 or 1). 



The convex cost function of this optimization problem is 
given by 

J(0) = iEr. 1 h/' ) ! 9(fe 6 ( I (,) ))-(l-y ( ' ) m(l^ e (i (,) ))] 

(6) 

where m is the number of samples in the training dataset 
and a; is a matrix consisting of m feature vectors of the 
training samples. We use gradient descent method to find 
the minimum of the cost function and the corresponding 
values in 9. 

4.2 Regression and Classification Results 

We shuffled the 4998 labelled samples and took 3500 of 
them as training set and the remaining 1498 as test set. 
The statistics of the two datasets are shown in Table[l] Note 
that the split of the dataset is arbitrary only to illustrate our 
detection method and results. We will demonstrate how our 
system adapts with data changes for real-time detection in 
the next section. 
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Figure 6: ROC Curve of the classification result with 
different threshold values 



Table 1: The number of samples of training and test 
datasets 





non-campaign 


campaign 


training 


1984 


1516 


test 


867 


631 



When the 9 is optimized, we then calculate the campaign 
score of each Q&A session in the test dataset. The distri- 
bution of scores for normal sessions and campaign sessions 
is shown in Figure [5] 



non-campaign 
campaign 
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Figure 5: Scores of test set (CDF) 

From the figure, we can see that the two types of Q&A 
sessions exhibit great difference on the distribution of the 
campaign scores. Most of the campaign scores are very close 
to their true labels (Y = 1 or Y = 0). Using the scores, we 
can either provide the raw scores to the users to help them 
make decisions when reading the answers, or we can assign 
the labels based on a threshold value, i.e., Y = 1 when the 
campaign score is larger than the threshold value and Y — 
otherwise. 



In Figure [6] we show the ROC curve based on different 
threshold values, 0.1, 0.2, 0.9. The points on the curve 
are mostly located at the top left position of the curve. The 
reason is that the campaign scores of most campaign sessions 
are higher than 0.9 while the campaign scores of most normal 
sessions are smaller than 0.1. This curve shows that the 
system performance is robust with a large range of threshold 
value. 

Based on Figure [5] and Figure [6] we set 0.5 as our current 
threshold. With this threshold, we evaluate the following 
four performance metrics: 

TruePositive 
TruePositive + FalsePositive 

TruePositive 
TruePositive + FalseNegative 
Precision * Recall 



Precision 



Recall 



F — measure = 2* 



Accuracy 



Precision + Recall 
TrueN egative + TruePositive 



Total NumberofU sers 

The classification results are shown in Table [2] Based on 
the performance results in the table, we can see that the 
proposed features for detecting campaign sessions are valid 
and effective. Though the performance of our offline analysis 
is very satisfactory, we will test its performance on other 
and bigger datasets for further validations of our feature 
set in the future. In the next section, we will introduce 
an adaptive online detection system which adaptively learns 
data changes over time and return detection results in real 
time. 



Table 2: Performance Results 



Precision 


Recall 


F-measure | Accuracy 


98.90% 


99.68% 


99.29% | 99.40% 



5. ADAPTIVE ONLINE DETECTION SYS- 
TEM 

In the previous section, we have shown that we can build 
a model to effectively calculate the campaign score and pre- 
dict the labels of unknown sessions. In practice, however, 



this offline analysis does not work well for users who would 
like to be advised of potential campaigns in real time. This 
requirement encourages us to design an online version of de- 
tection system, which can return campaign scores and/or 
predicted results in real time. We therefore build a proto- 
type of such an adaptive online detection system. The word 
"adaptive" implies that this system can update its database 
using new samples and generate new model parameters. 

5.1 Overview of System Design 

The major components of the detection system include 
browser plugin and a remote server. Figure [7] shows the sys- 
tem architecture and the communication between the client 
plugin and the server. 

As shown in Figure [7| the sequence of actions that take 
place when a user opens a Q/A session are: 

1. The plugin first sends only the URL of the page to the 
server. The server searches for the url in its database. 
If it is found, the server returns the score (spam rating) 
to the client. The client side script displays the result. 
This avoids unnecessarily sending complete web page 
to the server if it is already present in the database. 

2. If the URL is not present, the server sends a response 
not found and the client after receiving the response 
sends the rest of the data to the server through another 
XMLHTTPRequest and waits for the server's response. 

3. The server receives the data, segments the text into 
words, and stores it in the database. The server then 
extracts the statistical features necessary for the anal- 
ysis from the data. Logistic regression analysis is per- 
formed to predict the class of the session (spam or no 
spam). If the session is classified as a spam, an alert 
is returned back to the user. 

4. The client-side script displays the result to the user. 

5. (Optional) If the user is an authorized user, the user 
can provide feedback to the server (whether or not 
he/she feels the session is a campaign session). There 
are three types of users in the system: regular users 
are those who use our system and they are not granted 
the right to annotate sessions; helper users are those 
who have experience and are capable of helping label 
the data; the administrator is the person responsible 
for the management of the system. Note that helpers 
could be contracted out to employees of professional 
companies such as Rediff Shopping and eBay [l9] . 

6. When newly labelled sessions are available, the system 
updates the detection model using existing and newly 
labelled data. Note that this step could be done regu- 
larly in a daily or even weekly basis. 

5.2 Plugin Design 

The plugin is a Google Chrome extension. It must be in- 
stalled on the Chrome browser in the user's system. The 
plugin consists of manifest. json file, a HTML file and a 
contentscript.js file. The contentscript.js file specifies the 
javascript to be executed on the webpage the user is brows- 
ing. The manifest. json file contains information regarding 
the name, version of plugin and the HTML, script files asso- 
ciated with the plugin. The manifest file also contains a list 
of permissions that the plugin might use to access servers. 



The functions of the plugin can be separated into three 
major steps. 

1. Extract data from the webpage. All the data required 
from the webpage are extracted from the HTML source 
of the webpage. Separate javascript functions were 
written for extracting various information. The infor- 
mation extracted includes the page URL, Question, 
Questioner Name, Questioner URL, Time of posting 
question, Question Category, Best Answer, Answerer 
Name, Answerer URL, Time of posting answer, and 
Rating of the answer. All the functions are written in 
the contentscript.js file. 

2. Send data to the server. The server processes the data 
and returns the result. The client-side Javascript com- 
municates with the server by sending a XMLHTTPRe- 
quest. The POST method is used to send the request 
because the data to be sent may be big for using the 
GET method. Also for data extracted from the zhi- 
dao.baidu.com website the encoding of the data is set 
to gb2312 in order to encode Chinese characters. 

3. The result is displayed to the user. If the user is an 
authorized user, the user can enter his/her feedback to 
the server. 

5.3 Server Design 

The server communicates with the plugin and also main- 
tains a database system. The database system stores the 
information of Q&A sessions and the prediction model. The 
server receives the Q&A session data sent from the browser 
plugin. If the database has the label for the session, the 
server returns the label. If it is a new session, the server 
stores it in a buffer, calculates the spam grade based on 
the current model parameters, and returns the spam grade 
if necessary (i.e., a campaign session is detected). When 
enough data has been collected, we can use the helpers to 
label the data. Using logistic regression, the detection model 
will be updated using previous data as well as the newly la- 
belled data. 

5.4 Evaluation of Adaptive Online Detection 
System 

To evaluate the performance of adaptive online detection 
system, we use the collected data from Baidu Zhidao and 
relay the data in iterations to simulate a real-world scenario. 
In particular, we pretend that initially we only have partial 
data and use the data as the training dataset to build a 
detection model. In each iteration, we add some new sessions 
and use them as the test dataset to test the performance 
of the detection system. At the end of an iteration, the 
new sessions are added into the training dataset, and the 
detection model is updated using the new training dataset. 
This step corresponds to the scenario that new data are 
labelled and added into the system. Then we repeat with 
another iteration. 

For the test, we begin with 500-sample training set and 
build an initial detection model. At each iteration, we add 
200-sample test set. After evaluating the detection perfor- 
mance, we expand the training dataset with the 200 test 
samples, and update the detection model with the new train- 
ing dataset. We repeat this until we use up all 4998 samples. 

Figure [8] and Figure [9] show the update of model param- 
eters and the detection performance in each iteration. We 
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Figure 7: System architecture and communication between the client and the server 



can observe that the detection model tends to converge af- 
ter enough sessions have been added into the database over 
several iterations. This test scenario is similar to the prac- 
tical application where we predicate the unknown sessions 
using current knowledge and train a new model based on the 
sessions after we manually give labels to them. The good 
performance results similar to those presented in Section [4. 2| 
indicates that our system can effectively adapt model param- 
eters to achieve good performance. 
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Figure 8: Adaptive changes of model parameters 
over time 



Figure 9: The performance of the online detection 
system over time 



To illustrate the advantage of adaptiveness, we also per- 
form another test in which we fix the model after it is trained 
on the initial dataset. We use 200 samples as the initial 
training data and build a model. We fix the model param- 
eters, and at each iteration, we test 200 new sessions using 
the fixed model. The results are shown in Figure [To] 

Compared with Figure [9] we observe nearly perfect pre- 
cision but degraded performance on the rest of the metrics 
of the fixed model. We further looked into the values of 
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Figure 10: The performance of the fixed model 



TP, TN, FP, FN and found that the false negatives were too 
high in the non-adaptive model. However, the false positives 
were very low. It meant that the non-adaptive model clas- 
sified many campaign Q&A sessions as the non-campaign 
sessions. Consequently, although the precision is high, other 
decreased metrics implied that the non-adaptive model had 
obvious bias in classification. What's worse, this model can- 
not update itself by new samples because the parameters 
are only trained on the initial training dataset. Therefore, 
making the predication model adaptive to new samples is a 
necessary objective of the system design. 

6. RELATED WORK 

Our research is mostly related to work on spam detection 
and recognizing experts or authoritative users and trustwor- 
thy content in the social media. These topics have become 
crucial to many online services, especially the question and 
answer communities, whose contents are generated by mil- 
lions of users. We discuss prior work on two aspects. 

6.1 Retrieving high-quality answers in CQA 
sites 

A lot of research has been done on finding high quality 
content in CQA sites. However, we haven't seen any paper 
which explicitly solved the credibility problem introduced in 
our work. Usually, researchers treated the best answers as 
the high-quality answers which has the risk of being defeated 
by the paid posters. In our work, we explicitly consider the 
credibility issues about the best answers. 

Jeon et al. 
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in a community based question answering service with only 
non-textual features, such as Answerer's Acceptance Ration, 
Answer Length and User's Recommendation. They assumed 
the user feedback was a reliable source for the evaluation. 
Jurczyk et al. [15] p resented a study of link structure of 
Yahoo! Answers |30| . They adopted an adaptation of the 
HITS algorithm |17| for finding experts in the Q&A portal. 
Their research was also based on the assumption that the 



user feedback could be used to assign weights on the edges 
of their graph representing user relationships. 

Liu et al. [18] applied their automated summary technique 
to summarize answers for questions which ask for opinions. 
They used cosine similarity to cluster topic-oriented answers 
and eliminated irrelevant ones. Bian et al. [6] tried to use 
both relevance between questions and answers and the qual- 
ity of answers to retrieve good answers for a user query. Both 
textual features and statistical features such as user ratings 
were used in their approach. Later, in another work by Bian 
et al. [5] , they explicitly considered the effect of several vote 
spam attacks. Such activities involved malicious voting for 
specific answers to improve their ranking and to decrease 
the ranking of competitors at the same time. 

Agichtein et al. [2] studied the basic elements of social me- 
dia and combined three features of the social media (Yahoo! 
Answers) to facilitate the task of identifying high quality 
content, namely intrinsic content quality, interactions be- 
tween users and content usage statistics. HITS and PageR- 
ank algorithms were used to calculate the hubs and authori- 
ties users scores and usage statistics such as number of clicks 
of the Q&A session were used to complement the link-based 
analysis. 

Fichman [9] conducted a comparative study of answer 
quality on multiple Q&A websites, Yahoo! Answers, Wiki 
Answers [27] , Askville [3] and the Wikipedia Reference Desk 28 
Accuracy, completeness and verifiability were used as the 
quality measures for cross platform comparison. Fichman 
found that the quality of answers was significantly improved 
only in terms of answer completeness and verifiability, not 
the answer accuracy. 

6.2 Other research work about crowd-sourcing 
spams but in different realms 

Previous research has also investigated the crowd-sourcing 
spam in other areas. Jindal et al. [14] , Ott et al. [2TJ and 
Arjun et al. [19] attempted to detect fake review or opin- 
ion spam in the online shopping stores, like Amazon's on- 
line store. Similar to research in CQA websites, they also 
used textual similarity features and user-oriented features, 
like ratings and history records. Huang et al. [12] developed 
a regression model with features suggesting quality-biased 
short text in Microblogging service, Twitter. They judged 
the quality of tweets based on relevance, informativeness, 
readability, and politeness of the short content and assigned 
different scores from 1 to 5. However, they didn't explic- 
itly present how they define a spam-like tweet. 
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Huang 

conducted a similar study of commercial spam on blogging 
sites. They showed that the propaganda of some products 
in the comment of a blog post was crucial in detecting the 
malicious comments. The propaganda appeared in the form 
of URL, phone number, E-mail address, MSN numbers etc. 

7. CONCLUSIONS AND FUTURE WORK 

Detection of hidden campaigns can improve the user's ex- 
perience when using current social websites. In this paper, 
we disclose the behavior of a specific group of online paid 
posters who create commercial campaigns on the commu- 
nity Q&A websites. We collect real-world datasets and iden- 
tify effective features to distinguish normal sessions and the 
campaigns. The performance of our classifier, with inte- 
grated statistic and semantic analysis, is quite promising on 
the real-world case study. Based on a learning technique, 



we also implement a prototype of adaptive online detection 
system which can retrieve the result in real time. The cam- 
paign scores and/or predicated labels can help users make 
better decisions when searching for answers on CQA portals 
and help the questioners select better answers as well. 

This work is our first effort to detect online paid posters 
of CQA websites. In the future, we will test more features 
to improve the adaptive performance. 
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